49 research outputs found

    Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields

    Get PDF
    Motivation: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome

    Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors

    Get PDF
    We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and a MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. The results from all three models show substantial improvement over previous methods, which were based on the C5 algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining these sources of information, our approach results in a higher accuracy rate when compared to models that use each data source alone. Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information

    Multi-Target Prediction: A Unifying View on Problems and Methods

    Full text link
    Multi-target prediction (MTP) is concerned with the simultaneous prediction of multiple target variables of diverse type. Due to its enormous application potential, it has developed into an active and rapidly expanding research field that combines several subfields of machine learning, including multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. In this paper, we present a unifying view on MTP problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research

    Incorporating functional inter-relationships into protein function prediction algorithms

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Functional classification schemes (e.g. the Gene Ontology) that serve as the basis for annotation efforts in several organisms are often the source of gold standard information for computational efforts at supervised protein function prediction. While successful function prediction algorithms have been developed, few previous efforts have utilized more than the protein-to-functional class label information provided by such knowledge bases. For instance, the Gene Ontology not only captures protein annotations to a set of functional classes, but it also arranges these classes in a DAG-based hierarchy that captures rich inter-relationships between different classes. These inter-relationships present both opportunities, such as the potential for additional training examples for small classes from larger related classes, and challenges, such as a harder to learn distinction between similar GO terms, for standard classification-based approaches.</p> <p>Results</p> <p>We propose a method to enhance the performance of classification-based protein function prediction algorithms by addressing the issue of using these interrelationships between functional classes constituting functional classification schemes. Using a standard measure for evaluating the semantic similarity between nodes in an ontology, we quantify and incorporate these inter-relationships into the <it>k</it>-nearest neighbor classifier. We present experiments on several large genomic data sets, each of which is used for the modeling and prediction of over hundred classes from the GO Biological Process ontology. The results show that this incorporation produces more accurate predictions for a large number of the functional classes considered, and also that the classes benefitted most by this approach are those containing the fewest members. In addition, we show how our proposed framework can be used for integrating information from the entire GO hierarchy for improving the accuracy of predictions made over a set of base classes. Finally, we provide qualitative and quantitative evidence that this incorporation of functional inter-relationships enables the discovery of interesting biology in the form of novel functional annotations for several yeast proteins, such as Sna4, Rtn1 and Lin1.</p> <p>Conclusion</p> <p>We implemented and evaluated a methodology for incorporating interrelationships between functional classes into a standard classification-based protein function prediction algorithm. Our results show that this incorporation can help improve the accuracy of such algorithms, and help uncover novel biology in the form of previously unknown functional annotations. The complete source code, a sample data set and the additional files for this paper are available free of charge for non-commercial use at <url>http://www.cs.umn.edu/vk/gaurav/functionalsimilarity/</url>.</p

    Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data

    Get PDF
    Inference of protein functions is one of the most important aims of modern biology. To fully exploit the large volumes of genomic data typically produced in modern-day genomic experiments, automated computational methods for protein function prediction are urgently needed. Established methods use sequence or structure similarity to infer functions but those types of data do not suffice to determine the biological context in which proteins act. Current high-throughput biological experiments produce large amounts of data on the interactions between proteins. Such data can be used to infer interaction networks and to predict the biological process that the protein is involved in. Here, we develop a probabilistic approach for protein function prediction using network data, such as protein-protein interaction measurements. We take a Bayesian approach to an existing Markov Random Field method by performing simultaneous estimation of the model parameters and prediction of protein functions. We use an adaptive Markov Chain Monte Carlo algorithm that leads to more accurate parameter estimates and consequently to improved prediction performance compared to the standard Markov Random Fields method. We tested our method using a high quality S.cereviciae validation network with 1622 proteins against 90 Gene Ontology terms of different levels of abstraction. Compared to three other protein function prediction methods, our approach shows very good prediction performance. Our method can be directly applied to protein-protein interaction or coexpression networks, but also can be extended to use multiple data sources. We apply our method to physical protein interaction data from S. cerevisiae and provide novel predictions, using 340 Gene Ontology terms, for 1170 unannotated proteins and we evaluate the predictions using the available literature

    Building multiclass classifiers for remote homology detection and fold recognition

    Get PDF
    BACKGROUND: Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems. RESULTS: We present a comprehensive evaluation of a number of methods for building SVM-based multiclass classification schemes in the context of the SCOP protein classification. These methods include schemes that directly build an SVM-based multiclass model, schemes that employ a second-level learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build and combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes. CONCLUSION: Analyzing the performance achieved by the different approaches on four different datasets we show that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes that use predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Our results also show that the limited size of the training data makes it hard to learn complex second-level models, and that models of moderate complexity lead to consistently better results

    Directing Experimental Biology: A Case Study in Mitochondrial Biogenesis

    Get PDF
    Computational approaches have promised to organize collections of functional genomics data into testable predictions of gene and protein involvement in biological processes and pathways. However, few such predictions have been experimentally validated on a large scale, leaving many bioinformatic methods unproven and underutilized in the biology community. Further, it remains unclear what biological concerns should be taken into account when using computational methods to drive real-world experimental efforts. To investigate these concerns and to establish the utility of computational predictions of gene function, we experimentally tested hundreds of predictions generated from an ensemble of three complementary methods for the process of mitochondrial organization and biogenesis in Saccharomyces cerevisiae. The biological data with respect to the mitochondria are presented in a companion manuscript published in PLoS Genetics (doi:10.1371/journal.pgen.1000407). Here we analyze and explore the results of this study that are broadly applicable for computationalists applying gene function prediction techniques, including a new experimental comparison with 48 genes representing the genomic background. Our study leads to several conclusions that are important to consider when driving laboratory investigations using computational prediction approaches. While most genes in yeast are already known to participate in at least one biological process, we confirm that genes with known functions can still be strong candidates for annotation of additional gene functions. We find that different analysis techniques and different underlying data can both greatly affect the types of functional predictions produced by computational methods. This diversity allows an ensemble of techniques to substantially broaden the biological scope and breadth of predictions. We also find that performing prediction and validation steps iteratively allows us to more completely characterize a biological area of interest. While this study focused on a specific functional area in yeast, many of these observations may be useful in the contexts of other processes and organisms
    corecore